Johannes Bock, November 30, 2016
In the following analysis I will conduct some exploratory data analysis on the Prosper Loans data set.
According to their website “Prosper is America’s first marketplace lending platform, with over $7 billion in funded loans. Prosper allows people to invest in each other in a way that is financially and socially rewarding.” See also the Prosper Homepage.
The data set at hand contains 113,937 loans with 81 variables on each loan, including loan amount, borrower rate (or interest rate), current loan status, borrower income, borrower employment status, borrower credit history, and the latest payment information. The dataset contains loans created between 2005-Q4 and 2014-Q1.
Variable Definitions for the Prosper Loans Dataset can be found here: Variable Definitions.
For the following analysis I will focus on 24 variables, which I pre-selected based on my subjective assumptions. I expect that those variables will reveal some interesting insights about Prosper loans.
Before starting with the analysis of the data set, I decided to first have a look at the missing data. From the missing data map above, I can conclude that the data set only has missing data for a few variables. Especially, the variables TotalProsperLoans and ProsperPaymentsOneMonthPlusLate have a lot of missing records. In both cases it is likely that missing records are equivalent to no previous loans and no late payments respectively. There are missing values for ClosedDate, since the data set contains loans which are still active. Moreover, EstimatedReturn and ProsperScore was only calculated for loans that originated after July 2009.
In the following section I will look at the distributions, frequencies and key statistics of one variable at a time.
First, I decided to focus on the borrowers and see who is using the platform and why are they using it.
The visualization above clearly shows that most borrowers on Prosper come from California. This is not surprising for two reasons: (1) Prosper is a San Francisco, CA based company and (2) California is the home of Silicon Valley and therefore, it is likely that this state is more open-minded to new forms of lending.
Moreover, most borrowers on Prosper indicate to be Professional, Computer Programmer, Administrative Assistant, Executive, Teacher or Analyst (Note that the visualization above exludes those, who indicated “Other” as occupation). Especially, the large number of borrowers who are Executives is surprising to me, since these people tend to be quite creditworthy and earn the fourth highest monthly salary, which we will see later. However, they choose to borrow on Prosper instead of going the conventional way and borrow with their bank. Maybe Prosper offers attractive interest rates for this group of people? We will see this later.
Most borrowers indicate “Employed” and “Full-Time” as employment status.
The histograms above reveal further insights about the borrowers on Prosper. The average EmploymentStatusDuration is about 8 years and almost all borrowers indicate a number between 0 and 200 months. Surprisingly, there are also a few borrowers that have been in the same status for more than 500 months (~40 years).
Moreover, the average StatedMonthlyIncome is slightly above 5,000 USD, which is quite high. Looking at the histogram a majority of borrowers actually indicates a monthly income below 5,000 USD. It is important to note that I have also limited the x-Axis since some borrowers indicated a monthly income of more than 25.000 USD. Those are likely to be outliers.
The high number of unavailabe TotalProsperLoans indicates that most borrowers on Prosper use the platform for the first time, assuming unavailable data indicates no previous loans with Prosper. However, some borrowers have used the platform before.
Finally, most borrowers use the Prosper platform to consolidate their debts, to improve their homes and build their business.
In the following section I decided to have a closer look at the loans funded by the Prosper community. I am especially interested in the Returns for Investors, credit terms on Prosper and loan performance.
First, I calculated a new variable Actual Return, since there was no such information available in the data set. However, information such as loan amounts, customer payments, losses, and fees are available in the data set and therefore actual return can be calculated (Note: Actual returns were only calculated for closed loans). From the plot above one can see that the actual returns are distributed unevenly. Even though, the majority of returns is positive (roughly 50% of the actual returns lie between 0% and 25%), there are quite a few negative returns. There are even loans, where the actual loss is greater than 100%. Looking into the data, this loss exceeding the amount of the original loan is caused by fairly high collection and service fees on defaulted loans. Those loans usually have a 100% Principal loss and fees increase the loss for investors even above and beyond the principal.
Second, more than 50% of the loans are funded by at least 50 Investors, which clearly shows the crowd-sourcing character of the Prosper loans platform.
The LoanOriginalAmount plot indicates that borrowers usually ask for an amount less than 15.000 USD, even though there are also loans with much higher principals. Especially 5,000, 10,000 and 15,000 USD loans are very popular on Prosper.
Looking at the frequencies of loan status, Most loans in the data set are either completed or currently being paid back. However, there are about 19,000 non-performing loans in the data set, which equals roughly 15% of total observations. Later in this analysis I will look at some factors which potentially determine or predict loan performance.
The Prosper Platform was founded in 2005 and the number of listings has increased ever since. However, looking at the timeline above, the platform went through severe setbacks especially in 2008 and 2009 when the global financial crisis hit the banking sector. But Prosper managed to recover and exponentially increased its popularity up until 2014. When looking at this quarterly trend I got curious if there were some interesting patterns when looking at the number of listings created on a daily basis.
When examining the calendar visualisation above, one can immediately see that most listings in 2013 were created between September and October. Since we could see that from the quarterly visualisation already, it is more interesting to look at the horizontal patterns for weekdays. Having a close look one can see that for Sundays and Saturdays the coloring is lighter than for business days. Hence, it seems that borrowers on Prosper create their listings more often on business days instead of weekends.
More than 50% of borrowers have less than 250 USD of MonthlyLoanPayment. Moreover, most loans on Prosper have a duration of 3 years, but it is possible to get 12 and 60 months loans. The median interest rate on Prosper is about 19%, which is fairly high and probably much higher than for conventional loans. The median EstimatedReturn for investors calculated by Prosper since July 2009 is about 9%. This is a significant difference between interest rate and investor return, which represents the markup and costs of Prosper. Later in this analysis I will also compare actual return and estimated return for an appropriate period of time.
In order to assess the quality of Prosper loans, I not only want to look at loan performance but also at borrower ratings which are assigned ex ante and should help to identify risky investments.
Having a look at the two different credit ratings provided by the data set, I can readily see that both are quite normally distributed. The ProsperScore, which is calculated by the Prosper Platform on a discrete scale has an average of about 6. The CreditScore which is provided by an external consumer credit rating agency is measured at a continuous scale and the upper and lower bound only differ a little averaging at about 700.
Finally, I will look at some other variables which may help to identify risky loans on Prosper.
From the plots above, I can conclude that these variables are highly skewed and they are probably only useful for special cases where loans are not performing or likely to default in the near future. However, the plots show that only very few borrowers on Prosper have outstanding or late payments as well as only few delinquencies recorded in the last seven years. Moreover, the DebtToIncome Ratio for almost all borrowers on Prosper lies well below 50%, which is a reasonable ratio.
The data set at hand contains 113,937 loans with 81 variables on each loan, including loan amount, borrower rate (or interest rate), current loan status, borrower income, borrower employment status, borrower credit history, and the latest payment information. The dataset contains loans created between 2005-Q4 and 2014-Q1. In my analysis I focused on 24 variables.
There are a number of features in the data set which are interesting to explore. First, it will be interesting to examine loan performance in greater detail. Therefore, I will have a look at a number of variables, that might influence loan performance.
Second, actual returns are very important for investors and will eventually determine the success of the Prosper platform. The higher actual returns the more investors will be attracted and the more successful will Prosper become.
Third, it will be interesting to see how Prosper developed over time, since we could already see interesting patterns regarding the number of listings over the past years.
Regarding loan performance I expect that credit ratings, interest rates and borrower income will be good predictors of loan performance. Moreover, I expect that actual returns are also closely related to loan performance. Finally, I will look at the changes of loan performance and actual returns over time and I will try to identify reasons for Prosper’s success in recent years.
I calculated a new variable Actual Return, since there was no such information available in the data set. However, information such as loan amounts, customer payments, losses, and fees are available in the data set and therefore actual return can be calculated (Note: Actual returns were only calculated for closed loans). Additionally, I created a variable named Status which categorizes the loan status into performing and non-performing loans.
For the variables AmountDelinquent, DelinquenciesLast7Years, ProsperPaymentsOneMonthPlusLate and TotalProsperLoans I observed highly skewed and unusual distributions. This is the case because most loans in the data set have either no data recorded or a value of 0. For ActualReturns I observed an unusual distribution of negative returns, which is caused by defaulted loans on Prosper.
In order to display factor levels in chronological order I re-ordered default factor levels to make more intuitive sense when displayed in a bar chart. Moreover, I used dplyr to group, summarise and join data frames.
In the following section I will look at two variables at a time to reveal interesting interactions between variables.
In order to get a sense of correlations among the numeric variables in the data set I calculated a correlation matrix. To avoid distortions of my investigation, I considered the top and lowest 1% of the distributions of the following variables as outliers for calculating the correlations above: CreditScoreRangeUpper,CreditScoreRangeLower, AmountDelinquent,DelinquenciesLast7Years,DebtToIncomeRatio,StatedMonthlyIncome,TotalProsperLoans and ProsperPaymentsOneMonthPlusLate.
The visualisation above shows that only a few variables in the data set seem to be correlated (blue represents a positive relationship and red a negative). EmploymentStatusDuration,TotalProsperLoans,ProsperPaymentsOneMonthPlusLate and ActualReturn seem not to be correlated with any of the other numeric variables.
To get an broad overview, I created the scatter matrix above which includes a smoother for each pair of variable. For the correlated pairs of variables there seem to be some linear relationships, however, some relationships are non-linear. Moreover, the uncorrelated variables have pretty messy scatter plots, where no obvious relationship can be detected. To get a better understanding of the correlated variables I also plotted them separately.
Examining the bivariate relationships above, I can clearly identify a strong linear and positiv relationship between BorrowerRate and EstimatedReturn. This tells us a lot about how Prosper might calculate EstimatedReturn and the predictive power of this indicator for actual returns. A perfect linear relationship would suggests, that BorrowerRate is the only independent variable that determines EstimatedReturn. Even though there is some variation in the plot above which cannot be explained by the smoother (meaning that another variable may influence that variation), we will later see that this drastically changed over time. Moreover, we will see later in this analysis that returns for investors are not only dependent on BorrowerRate.
There is also a strong linear and positive relationship between LoanOriginalAmount and MonthlyLoanPayment. This makes totally sense, since a higher loan amount automatically causes more interest to be paid and principal amount to be repaid.
The remaining bivariate pairs of variables seem to relate to some extent. It seems that better credit ratings lead to a lower interest rate. Moreover, better credit ratings seem to allow borrowers to borrow a larger amount of money. Surprisingly, the external credit score and the Prosper score do not show a strong linear relationship. For example, there are many loans which receive bad external credit scores but very good Prosper scores. Later in this analysis, I will look into this in greater detail. Finally, since we can observe large variations not explained by the smoother, these more weakly correlated variables are likely to be influenced by more than one factor.
After having looked into EstimatedReturn and after having found that its calculation is likely to be rather simplistic and solely based on BorrowerRate, I got curious how it is related to actual returns. Therefore, I calculated a new variable which measures the difference between estimated return and actual return of a particular loan (where a negative difference indicates that actual returns were higher than estimated returns). From the plot above, one can see that for most loans the EstimatedReturn was less than the actual return (for about 75% of the loans). However, for about 25% of the loans Prosper over-estimated the return of the loan.
In the univariate analysis we have seen that Prosper experienced some major problems throughout the years but strongly increased the number of listings in recent years. The chart above might be one explanation for Prosper’s success. As you can see, actual returns were on average pretty bad for the pre 2009-Q2 period. But since the second quarter of 2009 actual returns for investors were on average constantly positive and around 10%, which is a fairly good return. Please note that the chart above only includes periods up until the first quarter of 2011, since after that date not all loans have had the chance to close in time (maximum term of 60 months on Prosper).
A possible explanation for increasing returns for investors may be the increasing quality of loans. The box plot above shows that median CreditScores constantly increased over the past years. Moreover, the box plot on CreditScores reveals what major policy change at Prosper in 2009 shifted the trend for ActualReturns. The lower whiskers, which were substantially higher from 2009 onwards, indicate that the minimum required creditScore to receive a loan on Prosper increased. This led higher actual return because fewer loans defaulted, which we will see from the plots below. Moreover, the monthly income of borrowers on Prosper did not change over the years, which confirms our findings from the correlation analysis above.
As I have expected above, the share of non-performing loans suddenly decreased in the second quarter of 2009, which can be explained by an increasing creditworthiness of borrowers on Prosper. This decrease in defaulting loans likely increased the improvement of returns for investors on Prosper. Moreover, from the line chart above I found that interest rates also increased from 2009-Q2 until 2011-Q1. This may also be an explanation for improving actual returns from 2009-Q2 onwards, as both variables correlate with each other.
So far we have seen that CreditScores correlate with LoanOriginalAmount and CreditScores of borrowers on Prosper have increased over the past years. Accordingly, we can observe an increase in the LoanOriginalAmount over the past years. Additionally, I was curious whether the DebtToIncomeRatio improved over the past years, since the credit ratings also improved. However, I found that the opposite is the case. The DebtToIncomeRatio increased over the past years, which suggests a weak relationship to CreditScores. But looking at the data this development actually is not as surprising as it seems. This is because we have seen that the borrower’s income has remained constant while loan amounts have increased, which must eventually lead to a worse DebtToIncomeRatio.
To summarize, we have seen that actual returns on Prosper improved most likely because of better credit ratings and fewer defaulting loans. However, we have not yet found reasons why credit ratings improved. So let’s rather look for reasons why loans default on Prosper.
First, the bar charts above indicate that 36 months loans have the highest proportion of non-performing loans. It is important to note that most loans on Prosper have a duration of 3 years, therefore it is not surprising that these loans also have the highest proportion of non-performing loans. Besides that it also makes sense that loans with longer duration are more likely to default since in 3 years time a lot can happen which worsens a borrower’s ability to pay back its debt.
Second, it is not surprising that EmploymentStatus has predictive power for loan performance. Borrowers who are in employment status “Self-employed” or “Retired” have the highest likelihood of defaulting.